fix: parse pgvector embeddings before cosine rescoring#175
fix: parse pgvector embeddings before cosine rescoring#175leonardsellem wants to merge 2 commits intogarrytan:masterfrom
Conversation
- change health check url to a more stable endpoint (`users/by/username/X`) - update auth label for clarity
Two related Postgres-string-typed-data bugs that PGLite hid:
1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
on the wire, storing JSONB columns as quoted string literals. Every
frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
indexes were inert. Switched to sql.json(value), which is the
postgres.js-native JSONB encoder (Parameter with OID 3802).
Affected columns: pages.frontmatter, raw_data.data,
ingest_log.pages_updated, files.metadata. page_versions.frontmatter
is downstream via INSERT...SELECT and propagates the fix.
2. pgvector embeddings returning as strings (utils.ts):
getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
Float32Array on Supabase, producing [NaN] cosine scores.
Adds parseEmbedding() helper handling Float32Array, numeric arrays,
and pgvector string format. Throws loud on malformed vectors
(per Codex's no-silent-NaN requirement); returns null for
non-vector strings (treated as "no embedding here"). rowToChunk
delegates to parseEmbedding.
E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.
Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey @leonardsellem — thank you. The Supabase NaN-score bug was a sharp catch. PR #196 (v0.12.1 hotfix) bundles your fix with the JSONB double-encode work since they're the same Postgres-string-typed-data class. Your One small adjustment per Codex outside-voice review during planning: Re-implemented rather than cherry-picked because the file overlapped heavily with #187's rowToChunk update — single coherent commit was cleaner. Co-authorship preserved. Mind if I close this PR in favor of #196? Thanks for the fix. |
…196) * fix: splitBody and inferType for wiki-style markdown content - splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extract wikilink syntax with ancestor-search slug resolution extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard - New gbrain repair-jsonb command. Detects rows where jsonb_typeof(col) = 'string' and rewrites them via (col #>> '{}')::jsonb across 5 affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly (the bug never affected the parameterized encode path PGLite uses). --dry-run shows what would be repaired; --json for scripting. - New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify. Modeled on v0_12_0 pattern, registered in migrations/index.ts. Runs automatically via gbrain upgrade / apply-migrations. - CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation pattern. Wired into bun test via package.json. Best-effort static analysis (multi-line and helper-wrapped variants are caught by the E2E round-trip test instead). - Updates apply-migrations.test.ts expectations to account for the new v0.12.1 entry in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.12.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.12.1 - CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parseEmbedding() throws on structural corruption — right call for ingest/ migrate paths where silent skips would be data loss. Wrong call for search/rescore paths where one corrupt row in 10K would kill every query that touches it. tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on any shape that would throw, warns once per session so the bad row is visible in logs. Use it anywhere we'd rather degrade ranking than blow up the whole query. Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call site) — the 5-line rescore loop was the direct motivator. Keep the throwing parseEmbedding() for everything else (pglite-engine rowToChunk, migrate-engine round-trips, ingest).
…kilinks, orphans (#216) * fix(sync): remove nested transaction that deadlocks > 10 file syncs sync.ts wraps the add/modify loop in engine.transaction(), and each importFromContent inside opens another one. PGLite's _runExclusiveTransaction is a non-reentrant mutex — the second call queues on the mutex the first is holding, and the process hangs forever in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct anyway (one file's failure should not roll back the others). (cherry picked from commit 4a1ac00) * test(sync): regression guard for #132 top-level engine.transaction wrap Reads src/commands/sync.ts verbatim and asserts no uncommented engine.transaction() call appears above the add/modify loop. Protects against silent reintroduction of the nested-mutex deadlock that hung > 10-file syncs forever in ep_poll. * feat(utils): tryParseEmbedding() skip+warn sibling for availability path parseEmbedding() throws on structural corruption — right call for ingest/ migrate paths where silent skips would be data loss. Wrong call for search/rescore paths where one corrupt row in 10K would kill every query that touches it. tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on any shape that would throw, warns once per session so the bad row is visible in logs. Use it anywhere we'd rather degrade ranking than blow up the whole query. Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call site) — the 5-line rescore loop was the direct motivator. Keep the throwing parseEmbedding() for everything else (pglite-engine rowToChunk, migrate-engine round-trips, ingest). * postgres-engine: scope search statement_timeout to the transaction searchKeyword and searchVector run on a pooled postgres.js client (max: 10 by default). The original code bounded each search with await sql`SET statement_timeout = '8s'` try { await sql`<query>` } finally { await sql`SET statement_timeout = '0'` } but every tagged template is an independent round-trip that picks an arbitrary connection from the pool. The SET, the query, and the reset could all land on DIFFERENT connections. In practice the GUC sticks to whichever connection ran the SET and then gets returned to the pool — the next unrelated caller on that connection inherits the 8s timeout (clipping legitimate long queries) or the reset-to-0 (disabling the guard for whoever expected it). A crash in the middle leaves the state set permanently. Wrap each search in sql.begin(async sql => …). postgres.js reserves a single connection for the transaction body, so the SET LOCAL, the query, and the implicit COMMIT all run on the same connection. SET LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores the previous value automatically, regardless of the code path out. Error paths can no longer leak the GUC. No API change. Timeout value and semantics are identical (8s cap on search queries, no effect on embed --all / bulk import which runs outside these methods). Only one transaction per search — BEGIN + COMMIT round-trips are negligible next to a ranked FTS or pgvector query. Also closes the earlier audit finding R4-F002 which reported the same pattern on searchKeyword. This PR covers both searchKeyword and searchVector so the pool-leak class is fully closed. Tests (test/postgres-engine.test.ts, new file): - No bare SET statement_timeout remains after stripping comments. - searchKeyword and searchVector each wrap their query in sql.begin. - Both use SET LOCAL. - Neither explicitly clears the timeout with SET statement_timeout=0. Source-level guardrails keep the fast unit suite DB-free. Live Postgres coverage of the search path is in test/e2e/search-quality.test.ts, which continues to exercise these methods end-to-end against pgvector when DATABASE_URL is set. (cherry picked from commit 6146c3b) * feat(orphans): add gbrain orphans command for finding under-connected pages Surfaces pages with zero inbound wikilinks. Essential for content enrichment cycles in KBs with 1000+ pages. By default filters out auto-generated pages, raw sources, and pseudo-pages where no inbound links is expected; --include-pseudo to disable. Supports text (grouped by domain), --json, --count outputs. Also exposed as find_orphans MCP operation. Tests cover basic detection, filtering, all output modes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit f50954f) * feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor extractEntityRefs now recognizes both syntaxes equally: [Name](people/slug) -- upstream original [[people/slug|Name]] -- Obsidian wikilink (new) Extends DIR_PATTERN to include domain-organized wiki slugs used by Karpathy-style knowledge bases: - entities (legacy prefix some brains keep during migration) - projects (gbrain canonical, was missing from regex) - tech, finance, personal, openclaw (domain-organized wiki roots) Before this change, a 2,100-page brain with wikilinks throughout extracted zero auto-links on put_page because the regex only matched markdown-style [name](path). After: 1,377 new typed edges on a single extract --source db pass over the same corpus. Matches the behavior of the extract.ts filesystem walker (which already handled wikilinks as of the wiki-markdown-compat fix wave), so the db and fs sources now produce the same link graph from the same content. Both patterns share the DIR_PATTERN constant so adding a new entity dir only requires updating one string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 1cfb156) * feat(doctor): jsonb_integrity + markdown_body_completeness detection Add two v0.12.1-era reliability checks to `gbrain doctor`: - `jsonb_integrity` scans the 4 known write sites from the v0.12.0 double-encode bug (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata) and reports rows where jsonb_typeof(col) = 'string'. The fix hint points at `gbrain repair-jsonb` (the standalone repair command shipped in v0.12.1). - `markdown_body_completeness` flags pages whose compiled_truth is <30% of the raw source content length when raw has multiple H2/H3 boundaries. Heuristic only; suggests `gbrain sync --force` or `gbrain import --force <slug>`. Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage that should have caught the original double-encode bug. Hits all four write sites against real Postgres and asserts jsonb_typeof='object' plus `->>'key'` returns the expected scalar. Detection only: doctor diagnoses, `gbrain repair-jsonb` treats. No overlap with the standalone repair path. * chore: bump to v0.12.3 + changelog (reliability wave) Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2 (JSONB double-encode + splitBody + wiki types + parseEmbedding) while this wave was mid-flight. Ships the remaining pieces as v0.12.3: - sync deadlock (#132, @sunnnybala) - statement_timeout scoping (#158, @garagon) - Obsidian wikilinks + domain patterns (#187 slice, @knee5) - gbrain orphans command (#187 slice, @knee5) - tryParseEmbedding() availability helper - doctor detection for jsonb_integrity + markdown_body_completeness No schema, no migration, no data touch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: update project documentation for v0.12.3 CLAUDE.md: - Add src/commands/orphans.ts entry - Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity + markdown_body_completeness check descriptions - Update src/core/link-extraction.ts to mention Obsidian wikilinks + extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw) - Update src/core/utils.ts to mention tryParseEmbedding sibling - Update src/core/postgres-engine.ts to note statement_timeout scoping + tryParseEmbedding usage in getEmbeddingsByChunkIds - Add Key commands added in v0.12.3 section (orphans, doctor checks) - Add test/orphans.test.ts, test/postgres-engine.test.ts, updated descriptions for test/sync.test.ts, test/doctor.test.ts, test/utils.test.ts - Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap - Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3) README.md: - Add gbrain orphans to ADMIN commands block Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: sunnnybala <dhruvagarwal5018@gmail.com> Co-authored-by: Gustavo Aragon <gustavoraularagon@gmail.com> Co-authored-by: Clevin Canales <clevin@Clevins-MacBook-Pro.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Clevin Canales <clev.canales@gmail.com>
Summary
Fix a Postgres/Supabase search bug where
gbrain querycould render[NaN]scores after cosine rescoring.
The root cause was that
getEmbeddingsByChunkIds()could return pgvector valuesas strings (for example
"[0.1, 0.2, ...]") instead ofFloat32Arrays. Thosestring values were then passed into cosine similarity math, which produced
NaNscores.This change adds a shared embedding parser and uses it in the Postgres path
before cosine rescoring.
Changes
parseEmbedding()insrc/core/utils.tsparseEmbedding()inrowToChunk()parseEmbedding()inPostgresEngine.getEmbeddingsByChunkIds()test/utils.test.tsWhy this is needed
Before this fix, hybrid search itself was returning valid RRF scores, but the
Postgres/Supabase cosine rescoring step converted them into
NaNbecause theembedding type coming back from the DB was not normalized.
After this fix, query results render normal numeric scores again and semantic
ranking works as expected on the Postgres engine.
Test Plan
Run the targeted unit tests:
bun test test/utils.test.ts test/search.test.tsManual verification against a real Supabase-backed brain:
gbrain query "what happened in ayor pmo meetings"Before this fix, the query output showed
[NaN]scores.After this fix, the query output shows normal numeric scores like
[0.8315].